Exploratory Data Analysis¶

Analyzing the NYC Airbnb dataset for price prediction.

In [1]:
%matplotlib inline
import wandb
import pandas as pd
import matplotlib.pyplot as plt

Load data from W&B¶

In [2]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)
wandb: Currently logged in as: danieludacity (danieludacity-udacity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
Tracking run with wandb version 0.22.3
Run data is saved locally in /home/daniel/DeepAI-Learn/Misc/ML_Workflow/build-ml-pipeline-for-short-term-rental-prices/src/eda/wandb/run-20251030_125644-vw18h74f
Syncing run ghastly-trouble-26 to Weights & Biases (docs)
View project at https://wandb.ai/danieludacity-udacity/nyc_airbnb
View run at https://wandb.ai/danieludacity-udacity/nyc_airbnb/runs/vw18h74f
In [3]:
df.head()
Out[3]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 9138664 Private Lg Room 15 min to Manhattan 47594947 Iris Queens Sunnyside 40.74271 -73.92493 Private room 74 2 6 2019-05-26 0.13 1 5
1 31444015 TIME SQUARE CHARMING ONE BED IN HELL'S KITCHEN... 8523790 Johlex Manhattan Hell's Kitchen 40.76682 -73.98878 Entire home/apt 170 3 0 NaN NaN 1 188
2 8741020 Voted #1 Location Quintessential 1BR W Village... 45854238 John Manhattan West Village 40.73631 -74.00611 Entire home/apt 245 3 51 2018-09-19 1.12 1 0
3 34602077 Spacious 1 bedroom apartment 15min from Manhattan 261055465 Regan Queens Astoria 40.76424 -73.92351 Entire home/apt 125 3 1 2019-05-24 0.65 1 13
4 23203149 Big beautiful bedroom in huge Bushwick apartment 143460 Megan Brooklyn Bushwick 40.69839 -73.92044 Private room 65 2 8 2019-06-23 0.52 2 8
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20000 non-null  int64  
 1   name                            19993 non-null  object 
 2   host_id                         20000 non-null  int64  
 3   host_name                       19992 non-null  object 
 4   neighbourhood_group             20000 non-null  object 
 5   neighbourhood                   20000 non-null  object 
 6   latitude                        20000 non-null  float64
 7   longitude                       20000 non-null  float64
 8   room_type                       20000 non-null  object 
 9   price                           20000 non-null  int64  
 10  minimum_nights                  20000 non-null  int64  
 11  number_of_reviews               20000 non-null  int64  
 12  last_review                     15877 non-null  object 
 13  reviews_per_month               15877 non-null  float64
 14  calculated_host_listings_count  20000 non-null  int64  
 15  availability_365                20000 non-null  int64  
dtypes: float64(3), int64(7), object(6)
memory usage: 2.4+ MB

Generate profile report¶

In [5]:
import ydata_profiling
profile = ydata_profiling.ProfileReport(df)
profile.to_notebook_iframe() # widget don't work :/
Upgrade to ydata-sdk

Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
  0%|                                                                                                                                                               | 0/16 [00:00<?, ?it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 85.24it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Observations¶

From the profile report and initial inspection:

  • Missing values in several columns
  • last_review is string format, should be datetime
  • price has outliers (very low and very high values)

Based on stakeholder input, reasonable price range is $10-$350.

Analyze price distribution¶

In [6]:
df['price'].describe()
Out[6]:
count    20000.000000
mean       153.269050
std        243.325609
min          0.000000
25%         69.000000
50%        105.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64
In [7]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].hist(df['price'], bins=50)
ax[0].set_title('Price Distribution')
ax[1].boxplot(df['price'])
ax[1].set_title('Price Boxplot')
fig
Out[7]:
No description has been provided for this image
No description has been provided for this image

Clean the data¶

In [8]:
# Remove price outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

Verify cleaned data¶

In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  number_of_reviews               19001 non-null  int64         
 12  last_review                     15243 non-null  datetime64[ns]
 13  reviews_per_month               15243 non-null  float64       
 14  calculated_host_listings_count  19001 non-null  int64         
 15  availability_365                19001 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(7), object(5)
memory usage: 2.5+ MB
In [10]:
df['price'].describe()
Out[10]:
count    19001.000000
mean       122.340456
std         71.530346
min         10.000000
25%         66.000000
50%        100.000000
75%        160.000000
max        350.000000
Name: price, dtype: float64
In [11]:
plt.hist(df['price'], bins=50)
plt.title('Price Distribution (Cleaned)')
plt.show()
No description has been provided for this image
In [ ]:
run.finish()